{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "k_PrSopOmOkf"
   },
   "source": [
    "# Sentiment analysis using Ktrain\n",
    "In this section, let us learn how to perform sentiment analysis using Ktrain. We will use the Amazon product reviews dataset. The dataset can be downloaded from here - http://jmcauley.ucsd.edu/data/amazon/.\n",
    "\n",
    "We can find the complete review data and also a small subset of data. In this exercise, we will use the subset of data containing the reviews of digital music. We can download digital music reviews from here - http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_5.json.gz. The downloaded digital music reviews will be in a compressed gzip format. So, after downloading, we will uncompress them and get the review in JSON format. \n",
    "\n",
    "Now, let us get started! First load the necessary libraries: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "id": "vxg6jlpGmXAZ"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install ktrain==0.25.3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "id": "oxPSssDFmOky"
   },
   "outputs": [],
   "source": [
    "import ktrain\n",
    "from ktrain import text\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LwU54ZP6mOk0"
   },
   "source": [
    "Download and load the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "t3G-RsPtnSKB",
    "outputId": "e1c9ebe1-8c23-4702-dbae-f3dd57620cc8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading...\n",
      "From: https://drive.google.com/uc?id=1-8urBLVtFuuvAVHi0s000e7r0KPUgt9f\n",
      "To: /content/reviews_Digital_Music_5.json\n",
      "89.0MB [00:00, 113MB/s] \n"
     ]
    }
   ],
   "source": [
    "!gdown https://drive.google.com/uc?id=1-8urBLVtFuuvAVHi0s000e7r0KPUgt9f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "id": "Z0qvYMmnmOk1"
   },
   "outputs": [],
   "source": [
    "df = pd.read_json(r'reviews_Digital_Music_5.json',lines=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m8_ctGTCmOk1"
   },
   "source": [
    "\n",
    "Let us have a look at a few rows of our dataset: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "U4_MIW9xmOk2",
    "outputId": "53d702d2-294e-4a9c-e6a4-b926f092ec85"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewerID</th>\n",
       "      <th>asin</th>\n",
       "      <th>reviewerName</th>\n",
       "      <th>helpful</th>\n",
       "      <th>reviewText</th>\n",
       "      <th>overall</th>\n",
       "      <th>summary</th>\n",
       "      <th>unixReviewTime</th>\n",
       "      <th>reviewTime</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A3EBHHCZO6V2A4</td>\n",
       "      <td>5555991584</td>\n",
       "      <td>Amaranth \"music fan\"</td>\n",
       "      <td>[3, 3]</td>\n",
       "      <td>It's hard to believe \"Memory of Trees\" came ou...</td>\n",
       "      <td>5</td>\n",
       "      <td>Enya's last great album</td>\n",
       "      <td>1158019200</td>\n",
       "      <td>09 12, 2006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AZPWAXJG9OJXV</td>\n",
       "      <td>5555991584</td>\n",
       "      <td>bethtexas</td>\n",
       "      <td>[0, 0]</td>\n",
       "      <td>A clasically-styled and introverted album, Mem...</td>\n",
       "      <td>5</td>\n",
       "      <td>Enya at her most elegant</td>\n",
       "      <td>991526400</td>\n",
       "      <td>06 3, 2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A38IRL0X2T4DPF</td>\n",
       "      <td>5555991584</td>\n",
       "      <td>bob turnley</td>\n",
       "      <td>[2, 2]</td>\n",
       "      <td>I never thought Enya would reach the sublime h...</td>\n",
       "      <td>5</td>\n",
       "      <td>The best so far</td>\n",
       "      <td>1058140800</td>\n",
       "      <td>07 14, 2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A22IK3I6U76GX0</td>\n",
       "      <td>5555991584</td>\n",
       "      <td>Calle</td>\n",
       "      <td>[1, 1]</td>\n",
       "      <td>This is the third review of an irish album I w...</td>\n",
       "      <td>5</td>\n",
       "      <td>Ireland produces good music.</td>\n",
       "      <td>957312000</td>\n",
       "      <td>05 3, 2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A1AISPOIIHTHXX</td>\n",
       "      <td>5555991584</td>\n",
       "      <td>Cloud \"...\"</td>\n",
       "      <td>[1, 1]</td>\n",
       "      <td>Enya, despite being a successful recording art...</td>\n",
       "      <td>4</td>\n",
       "      <td>4.5; music to dream to</td>\n",
       "      <td>1200528000</td>\n",
       "      <td>01 17, 2008</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       reviewerID        asin  ... unixReviewTime   reviewTime\n",
       "0  A3EBHHCZO6V2A4  5555991584  ...     1158019200  09 12, 2006\n",
       "1   AZPWAXJG9OJXV  5555991584  ...      991526400   06 3, 2001\n",
       "2  A38IRL0X2T4DPF  5555991584  ...     1058140800  07 14, 2003\n",
       "3  A22IK3I6U76GX0  5555991584  ...      957312000   05 3, 2000\n",
       "4  A1AISPOIIHTHXX  5555991584  ...     1200528000  01 17, 2008\n",
       "\n",
       "[5 rows x 9 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ID-DsdnWmOk3"
   },
   "source": [
    "\n",
    "We only need the review text and overall rating, so let us subset the dataset with only the reviewText and overall column as shown below:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true,
    "id": "rPh3z4iWmOk3"
   },
   "outputs": [],
   "source": [
    "df = df[['reviewText','overall']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "TT1r2wWAmOk4",
    "outputId": "fb3b08cd-c872-4836-ce2b-d428119c6dff"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>overall</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>It's hard to believe \"Memory of Trees\" came ou...</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A clasically-styled and introverted album, Mem...</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>I never thought Enya would reach the sublime h...</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>This is the third review of an irish album I w...</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Enya, despite being a successful recording art...</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText  overall\n",
       "0  It's hard to believe \"Memory of Trees\" came ou...        5\n",
       "1  A clasically-styled and introverted album, Mem...        5\n",
       "2  I never thought Enya would reach the sublime h...        5\n",
       "3  This is the third review of an irish album I w...        5\n",
       "4  Enya, despite being a successful recording art...        4"
      ]
     },
     "execution_count": 7,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kw90L_humOk4"
   },
   "source": [
    "\n",
    "We can notice, we have ratings ranging from 1 to 5. Let us convert these ratings to sentiment by mapping ratings (1 to 3) to negative class and (4 to 5) to positive class:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true,
    "id": "ME-HKMqgmOk5"
   },
   "outputs": [],
   "source": [
    "sentiment = {1:'negative',2:'negative',3:'negative',\n",
    "             4:'positive',5:'positive'}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true,
    "id": "7BHZX-zNmOk5"
   },
   "outputs": [],
   "source": [
    "\n",
    "df['sentiment'] = df['overall'].map(sentiment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8LfZ1aY8mOk6"
   },
   "source": [
    "\n",
    "Now, let us subset the dataset with only the reviewText, and sentiment column as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true,
    "id": "mU6YAVsgmOk6"
   },
   "outputs": [],
   "source": [
    "df = df[['reviewText','sentiment']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "qTkz7jAxmOk6",
    "outputId": "fe7f7fb5-6f22-4ce5-f2af-afb802fcc04d"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>It's hard to believe \"Memory of Trees\" came ou...</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A clasically-styled and introverted album, Mem...</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>I never thought Enya would reach the sublime h...</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>This is the third review of an irish album I w...</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Enya, despite being a successful recording art...</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText sentiment\n",
       "0  It's hard to believe \"Memory of Trees\" came ou...  positive\n",
       "1  A clasically-styled and introverted album, Mem...  positive\n",
       "2  I never thought Enya would reach the sublime h...  positive\n",
       "3  This is the third review of an irish album I w...  positive\n",
       "4  Enya, despite being a successful recording art...  positive"
      ]
     },
     "execution_count": 11,
     "metadata": {
      "tags": []
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "og9gT_vfmOk7"
   },
   "source": [
    "\n",
    "\n",
    "From the above result, we can notice that we have review text and its corresponding sentiment. \n",
    "\n",
    "The next step is creating the train and test sets. If our data is in a pandas data frame, we can use a texts_from_df function and if our data is in a folder then we can use a texts_from_folder function. \n",
    "\n",
    "Since our dataset is in the pandas data frame, we will use the texts_from_df function. The argument of the function includes the following: \n",
    "\n",
    "- train_df - dataframe containing the reviews and their sentiment\n",
    "- text_column - the name of the column containing the reviews\n",
    "- label_column - the name of the column containing the label \n",
    "- maxlen - maximum length of the review \n",
    "- max_features - the maximum number of words we use in vocabulary \n",
    "- preprocess_mode - It used for preprocessing the text. If we want to use normal tokenization then we set the preprocess_mode to standard or if we want to perform tokenization as we do in BERT, then we set the preprocess_mode to bert \n",
    "\n",
    "\n",
    "In this exercise, we will set maxlen t0 100 and max_features to 100000. We use bert as the preprocess_mode since we are going to use the BERT model for performing classification: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 136
    },
    "id": "cFHL5Y08mOk7",
    "outputId": "70035bbd-ee52-4c59-bd7d-92af96364f05"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "preprocessing train...\n",
      "language: en\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "done."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Multi-Label? False\n",
      "preprocessing test...\n",
      "language: en\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "done."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {
      "tags": []
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(train_df = df, \n",
    "                                                                   text_column = 'reviewText',\n",
    "                                                                   label_columns=['sentiment'],\n",
    "                                                                   maxlen=100, \n",
    "                                                                   max_features=100000,\n",
    "                                                                   preprocess_mode='bert',\n",
    "                                                                   val_pct=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R6iz_0AZmOk8"
   },
   "source": [
    "\n",
    "From the above result, we can notice that the ktrain provides a diverse set of classifiers ranging from logistic regression, bidirectional GRU to the BERT model. In this tutorial, we will use the BERT model. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BGqB3RcZmOk8"
   },
   "source": [
    "\n",
    "Now, let's define the model using the text_classifier function which builds and returns a classifier. The following are the important arguments to the function:\n",
    "\n",
    "\n",
    "- name - name of the model we want to use, in this case, we will use bert\n",
    "- train_data - a tuple containing our train data which is (x_train, y_train)\n",
    "- preproc - instance of our preprocessor \n",
    "- metrics - metrics with which we want to access the performance of our model, in this example we will use accuracy\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ZmJ3rOs8mOk9",
    "outputId": "0734aff5-a5f6-4e3b-e17e-f9d23218d683"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Is Multi-Label? False\n",
      "maxlen is 100\n",
      "done.\n"
     ]
    }
   ],
   "source": [
    "model = text.text_classifier(name='bert', train_data = (x_train, y_train) , preproc=preproc, metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pH-45invmOk9"
   },
   "source": [
    "\n",
    "Next, we create an instance called learner which is used for training our model. We will use the function get_learner for creating the learner instance. The following are the important arguments to the function: \n",
    "\n",
    "- model - model which we defined in the previous step \n",
    "- train_data - a tuple containing our training data \n",
    "- val_data - a tuple containing our test data \n",
    "- batch_size - a batch size which we want to use \n",
    "- use_multiprocessing - a boolean value indicating whether we want to use multiprocessing \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true,
    "id": "ZO7O-260mOk-"
   },
   "outputs": [],
   "source": [
    "learner = ktrain.get_learner(model = model, \n",
    "                             train_data=(x_train, y_train), \n",
    "                             val_data=(x_test, y_test), \n",
    "                             batch_size=32, \n",
    "                             use_multiprocessing = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xOuuKG_8mOk-"
   },
   "source": [
    "\n",
    "Now, we can finally train the model using the fit_onecycle function. The following are the important arguments to the function: \n",
    "\n",
    "    \n",
    "- lr - learning rate \n",
    "- epoch - a number of epochs we want to train \n",
    "- checkpoint_folder - directory where we want to store the model weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true,
    "id": "0YIpdqOOmOk-"
   },
   "outputs": [],
   "source": [
    "learner.fit_onecycle(lr=2e-5, epochs=1,checkpoint_folder='output')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "obWkpQ5qmOk_"
   },
   "source": [
    "\n",
    "In this example, we are training only for one epoch for simplicity. The above code will print the following:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "t9pv8hBKmOk_"
   },
   "source": [
    "\n",
    "As we can observe from the following results, we have obtained 87% accuracy on the test set. That's it. Training a model using ktrain is this simple. \n",
    "\n",
    "Now, we can use the trained model and make predictions using the get_predictor function. We need to pass our trained model and the instance of our preprocessor: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "vIv4LD89mOlA"
   },
   "outputs": [],
   "source": [
    "predictor = ktrain.get_predictor(learner.model, preproc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OdXsbPgFmOlA"
   },
   "source": [
    "\n",
    "Next, we can make a prediction with the predict function by passing the text: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "DPbePspOmOlB"
   },
   "outputs": [],
   "source": [
    "predictor.predict('I loved the song')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "M9-P3wt2mOlB"
   },
   "source": [
    "As we can observe, our model has identified that the given text is a positive sentence. "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "9.07. Sentiment analysis using Ktrain.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
